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Abstract 

Background: Clustering-based methods on gene-expression analysis have been shown to be useful in biomedical 
applications such as cancer subtype discovery. Among them, Matrix factorization (MF) is advantageous for 
clustering gene expression patterns from DNA microarray experiments, as it efficiently reduces the dimension of 
gene expression data. Although several MF methods have been proposed for clustering gene expression patterns, 
a systematic evaluation has not been reported yet. 

Results: Here we evaluated the clustering performance of orthogonal and non-orthogonal MFs by a total of nine 
measurements for performance in four gene expression datasets and one well-known dataset for clustering. 
Specifically, we employed a non-orthogonal MF algorithm, BSNMF (Bi-directional Sparse Non-negative Matrix 
Factorization), that applies bi-directional sparseness constraints superimposed on non-negative constraints, 
comprising a few dominantly co-expressed genes and samples together. Non-orthogonal MFs tended to show 
better clustering-quality and prediction-accuracy indices than orthogonal MFs as well as a traditional method, K- 
means. Moreover, BSNMF showed improved performance in these measurements. Non-orthogonal MFs including 
BSNMF showed also good performance in the functional enrichment test using Gene Ontology terms and 
biological pathways. 

Conclusions: In conclusion, the clustering performance of orthogonal and non-orthogonal MFs was appropriately 
evaluated for clustering microarray data by comprehensive measurements. This study showed that non-orthogonal 
MFs have better performance than orthogonal MFs and /(-means for clustering microarray data. 



Background 

DNA microarray can simultaneously measure the 
expression levels of thousands of genes. Increasingly, the 
challenge is to interpret such data to reveal molecular 
biological processes and the mechanism of human dis- 
eases. One of the main goals of expression data analysis 
is to identify the changing and unchanging genes and to 
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correlate these changes with similar expression profiles. 
One of the major challenges for gene expression analysis 
is the reduction of dimension. Gene expression data 
typically have high dimensionality, with tens of thou- 
sands of genes whereas the number of observations or 
experiments is usually under a hundred. Because the 
number of variables easily exceeds that of experiments, 
dimension reduction is obviously required for gene 
expression analysis. This task can be considered as a 
matrix factorization problem. 

Matrix factorization (MF) methods on microarray data 
can extract distinct patterns from the data [1-5]. Princi- 
pal Component Analysis (PCA) and Singular Value 
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Decomposition (SVD) are popular analysis methods, and 
they have been applied to classification problems with 
satisfactory results [1,5]. However, because of the holis- 
tic nature of PCA or SVD, it is difficult to provide the 
biologically instinctive interpretation of data from the 
obtained components. In order to overcome this limita- 
tion, Paatero and Tapper [6] and Lee and Seung [7] pro- 
posed that non-negative matrix factorization (NMF) can 
learn part-based representations that can provide the 
obvious interpretation. The non-negativity constraints 
make the representation purely additive (allowing no 
subtractions), in comparison with many other linear 
representations such as PCA and Independent Compo- 
nent Analysis (ICA) [8]. Their work was applied to sig- 
nal processing and text mining. Brunet et al. [9] applied 
NMF to describe the gene expression profiles of all 
genes in terms of a few number of metagenes in order 
to derive meaningful biological information from cancer 
expression datasets. They clustered the samples into dis- 
tinct subtypes by metagene expression patterns. 

The gene expression patterns can be sparsely encoded 
by metagenes, implying a few significantly co-expressed 
genes. Several groups have proposed NMF formulation 
that enforces the sparseness of the decomposition. Li et 
al. [10] proposed local NMF (LNMF) that has additional 
constraints to enforce the sparseness in the NMF. Hoyer 
[11,12] also proposed NMF formulation that can find 
parts-based representations by explicitly incorporating 
the concept of sparseness. Wang et al. [13] demon- 
strated Fisher non-negative matrix factorization (FNMF) 
that learns localized features by imposing Fisher con- 
straints. Gao and Church [14] attempted to control 
sparseness by penalizing the number of non-zero entries 
unlike other methods. 

Sample-based clustering, however, is not the only con- 
cern in microarray data analysis. Gene-based clustering 
provides informative sets of tightly co-regulated genes. 
While sample-based clustering relies on metagenes, 
gene-based clustering relies on meta-samples. The two 
processes can be viewed as bi-directionally constrained 
with each other. Good metagene may support good 
sample-based clusters and vice versa. Optimizing sam- 
ple- dimension only, sparseness of gene-dimension is 
relatively decreased when sparseness of sample-dimen- 
sion is increased. In result, the minimization problem is 
convex that was subsequently described by others 
[11,12,14,15] and resulting matrix cannot support gene- 
based clusters well. Therefore, optimizing both sample 
and gene dimension together may be appropriated for 
clustering of microarray data. Here, we employed a 
novel non-orthogonal MF algorithm, Bi-directional 
Non-negative Matrix Factorization (BSNMF), with bi- 
directional sparseness constraints superimposed on non- 
negative constraints, comprising a few dominantly co- 



expressed genes and samples together. The bi-direc- 
tional optimization process may provide quality cluster- 
ing with improved biological relevance that may not be 
achieved by applying MFs for each dimension separately. 

Many clustering-based methods are developed to 
transform a large matrix of gene expression levels into a 
more informative set of which genes are highly possible 
to share biological properties. Although clustering-based 
algorithms for microarray data analysis have been exten- 
sively studies, most works have not focused on the sys- 
tematic comparison and validation of clustering results. 

Different algorithms tend to lead to different cluster- 
ing solutions on the same data, while the same algo- 
rithm often leads to different results for different 
parameter settings. Since there is no consensus on 
choosing among them, the applicable measures should 
be applied for assessing the quality of a clustering solu- 
tion in different situations. For example, when the true 
solution is known and we can compare it to another 
solution, Minkowski measure [16] or the Jaccard coeffi- 
cient [17] is applicable. Whereas, when the true solution 
is not known, there is no agreed-upon method for vali- 
dating the quality of a suggested solution. Several meth- 
ods evaluate clustering solutions based on intra-cluster 
homogeneity or inter-cluster separation [18,19]. Mean- 
while, the prediction of the correct number of clusters is 
a basic problem in unsupervised classification problems. 
To solve this problem, a number of cluster validity 
indices, assessing the quality of a clustering partition 
have been proposed. 

In the present paper, we would like to systematically 
evaluate various MFs applied to gene-expression data 
analysis. We compare six MFs, including two orthogonal 
MFs (i.e. PCA and SVD) and four non-orthogonal MFs 
(i.e. ICA, NMF and NMF with sparseness constraints 
(SNMF) and BSNMF) and a well-known unsupervised 
clustering method, K-means algorithm. All were evalu- 
ated by seven cluster-evaluation indices. We evaluated 
them in view of basic three categories: (1) traditional 
clustering, (2) orthogonal MFs and (3) non-orthogonal 
MFs. Predictive power and consistency of the methods 
are evaluated by using adjusted Rand Index and accu- 
racy index when the class labels of data were available. 
To evaluate the biological relevance of the resulting 
clusters from different algorithms, we evaluated the sig- 
nificance of the biological enrichment for the clusters by 
using Gene Ontology (GO) and biological pathway 
annotations. 

Results 

Evaluation of each clustering-based method 

In our study, we applied /Omeans algorithm and six 
MFs, which are two orthogonal (i.e. SVD and PCA) and 
four non-orthogonal (i.e. ICA, NMF, SNMF and 
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BSNMF) algorithms to the five benchmarking datasets. 
We evaluated the seven methods using nine measures, 
including seven cluster evaluation indices and two pre- 
diction power measures. Fig. 1 exhibits results from the 
seven cluster-quality measures. We repeatedly applied 
the clustering (or MFs) algorithms 20 times for each 
dataset for each number of clusters, i.e. K = 2 to 4 (for 
the Iris dataset) or 2 to 5 (for the rest). The values in 
Fig.l represent the averages. 



Among measures, the GAP statistic is optimized when 
it decreases (Fig. 1(g)), while others are optimized when 
they increase (Fig. 1(a) - (f)). The homogeneity, separa- 
tion, Dunn Index, average silhouette width and Hubert 
correlation (i.e. Hubert's gamma) tend to be higher for 
non-orthogonal MFs than results from orthogonal MFs 
and 7<T-means algorithm. The GAP statistic is lower for 
non-orthogonal MFs than orthogonal MFs and K- 
means. But, Pearson correlation of cophenetic distance 
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Figure 1 Illustration of various measures. Illustration of various measures. Here, we evaluated seven methods by six measures. Each illustration 

shows results from various measures such as (a) Homogeneity, (b) separation, (c) Dunn Index, (d) average silhouette width, (e) Pearson 

correlation of cophenetic distance, (f) Hubert gamma and (g) GAP statistic. GAP statistic is optimized when it has lower value. But other 

measures which have higher value are optimized. 
I ) 
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has the highest value for SVD (Fig. 1(e)). Overall, non- 
orthogonal MFs represented best clustering quality. 

We compared homogeneity with separation at the 
same time (Additional File 1). Results from measures for 
each dataset were clustered. Results from NMF, SNMF 
and BSNMF showed higher slope, that is, their homoge- 
neity and separation are more optimized than others. 
When we compare one of the measures, Hubert correla- 
tion of cophenetic distance between MFs, at each num- 
ber of clusters (Additional File 2), NMF, SNMF and 
BSNMF showed better performance than others in four 
datasets except for the Leukemia dataset. ICA has the 
highest value for the Leukemia dataset. Overall, non- 
negative MFs have best clustering quality. 

The three datasets, Leukemia, Medulloblastoma and 
Iris datasets have known class labels as 'gold standards'. 
For the three datasets, we measured accuracy or predic- 
tive power using the adjusted Rand Index and prediction 
accuracy. Fig. 2 shows the adjusted Rand Index for the 
correct classification for the three datasets with the 
seven methods (i.e. six MFs and 7<T- means method). The 
Leukemia dataset was evaluated at both two-class (i.e. 
AML vs. ALL, Fig. 2(a)) and three-class (i.e. AML vs. T 
cell type vs. B cell type, Fig. 2(b)) levels. Fig. 2 demon- 
strates that BSNMF, SNMF and NMF have the highest 
Adjusted Rand Index for most of the evaluations. 

Fig. 3 shows the results from prediction accuracy. 
SNMF and BSNMF tend to show the best accuracy 
measures. We also included a voting scheme that simply 
combines all the results from the various algorithms and 
returns the best consensus. Voting showed comparable 
results to SNMF and BSNMF. 

Detailed class prediction results for the Leukemia 
dataset are shown in Table 1. Class assignment is opti- 
mized for each dataset when accuracy has the highest 
value. All methods were tested both at K=2 and K=3. 
At K=2 level, one AML sample (AML_12) was incor- 
rectly assigned to ALL by SNMF and BSNMF. The 
result is the same as that of Gao et al. [14]. The error 
count for NMF was two (ALL_7092_B cell and 
ALL_14749_B cell). Overall, non-orthogonal MFs like 
BSNMF, SNMF, NMF and ICA showed higher predic- 
tion accuracy than orthogonal MFs and 7<T-means algo- 
rithm. At K=3 level, BSNMF showed the best results 
with only one mistake that AML_13 was incorrectly 
assigned to ALL, while SNMF made two mistakes 
(AML_13 and ALL_21302_B cell). Table 2 shows the 
results for the Medulloblastoma dataset K=2. BSNMF 
showed the best result with 11 mistakes, while SNMF 
and NMF have 13 and ICA has 14. 

Evaluation of biological relevance 

To evaluate the biological relevance of the clustering 
results, we created clusters of genes and assigned them 



to the corresponding sample-wise clusters. For MFs, we 
clustered genes by using coefficient matrix of genes. For 
instance, in the Leukemia dataset factorized by NMF at 
7<T=2, we clustered genes into two groups by using the 
coefficient matrix of genes, W, from NMF. Given such a 
factorization, the matrix W is able to be used to deter- 
mine the gene cluster membership, that is, a gene i is 
placed in a cluster ; if the Wg is the largest entry in row 
i. Applying /Omeans algorithm, we clustered genes 
using original gene expression data matrix. Then, we 
labelled gene-cluster corresponding to the labels of sam- 
ple-cluster. 

Gene-wise clusters are annotated by GO terms and 
biological pathways. We measured the significance of 
GO term (or pathway) assignment by using hyper-geo- 
metric distribution. Here we briefly regard each GO 
term and biological pathway as a term. Table 3 shows 
the numbers of significantly enriched terms for the cor- 
responding clusters at p < 0.05. For the Leukemia data- 
set, BSNMF (#=535) and NMF (#=532) have the 
highest numbers of significantly enriched terms in ALL. 
BSNMF has the highest numbers in AML (#=280) and 
in total (#=815) (Table 3(a)). Table 3(b) shows the 
results from Medulloblastoma dataset. In cluster 1, 
BSNMF (#=599) and /C-means (#=517) have the most 
significantly enriched terms. In cluster 2, SVD (#=361) 
and NMF (#=335) have the most terms. The total num- 
ber of significant terms is the biggest with BSNMF 
(#=805). Table 3(c) demonstrates that the fibroblast 
dataset has the biggest total number of significant terms 
for BSNMF (#=504). Table 3(d) exhibits the result from 
the mouse dataset. In cluster 1, BSNMF (#=690) and 
SNMF (#=686) have the most significantly enriched 
terms. In cluster 2, ICA (#=114) has the most terms. 
The total number of significant terms is the biggest with 
BSNMF (#=746). Overall, the numbers of significantly 
enriched terms resulting from non-orthogonal MFs, 
BSNMF, SNMF, NMF and ICA, are bigger than those of 
orthogonal MFs and 7<T-means algorithm. 

Dueck et al. [20] summarized GO terms with signifi- 
cance to the resulting clusters from various clustering 
algorithms using two representations: the proportion of 
factors that are significantly enriched for at least one 
functional category at a=0.05 and the mean log 10 (p- 
value). We combined two representations. We calcu- 
lated the weighted ^-values, the proportion of significant 
GO terms multiplies the negative log 10 (p-vdlue). Fig. 4 
shows the weighted ^-values of the GO terms signifi- 
cantly annotated to the corresponding clusters for the 
Leukemia and Medulloblastoma datasets. The weighted 
^-values are more significant when they have higher 
value. For simplicity, we plotted the top 50 terms. Plots 
for other dataset can be found in the supplement web 
site (http://www.snubi.org/software/BSNMF/). For the 
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Figure 2 Illustration of the Adjusted Rand index. Illustration of the Adjusted Rand index, (a) Result from leukemia dataset which has known 
class labels with two groups, ALL and AML, We tested various methods at rank k=2. (b) From leukemia dataset with three groups, ALL-B, ALL-T 
and AML. We applied the adjusted Rand index at rank k=3. (c) From medulloblastoma dataset which has known class labels with two groups, 
classic and desmoplastic. (d) From iris dataset that has known class labels with three groups of flower species. 



Leukemia dataset, BSNMF and 7<T-means were shown to 
have annotated terms with the highest significance in 
AML and BSNMF and SNMF in ALL (Fig. 4(a), (b)). 
Overall, BSNMF and SNMF showed the highest signifi- 
cance for the whole Leukemia dataset (Fig. 4(c)). In the 
medullobalstoma dataset, BSNMF and 7<T-means for the 
first cluster and BSNMF and SVD for the second cluster 
had the higher weighted Rvalue than other methods. 



Overall, BSNMF showed the best results (Fig. 4(d) - (f)). 
Therefore, genes in the clusters created by BSNMF 
seemed to be more biologically associated in terms of 
GO term annotations than those created by other 
methods. 

The ^-values are calculated for each GO category and 
for each pathway resource (Fig. 5). The GO term (or 
pathway) annotation having lower ^-values represents 
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Figure 3 Illustrations of accuracy. Illustrations of accuracy. It measures prediction power of clustering. 
Leukemia dataset, Medulloblastoma dataset and Iris dataset which have known labels of sample-class. 



ar plot of accuracy from three dataset, 



that corresponding cluster in terms of sharing GO terms 
(or pathways) is more relevant biologically. The result 
for /Omeans and BSNMF in the AML cluster is only 
shown. Other results are found in the supplement web 
site. Overall, non-orthogonal MFs tend to create more 
enriched clusters. 

The top- ranked genes with the largest coefficient in 
W matrix of BSNMF may be most explanatory for each 
cluster (Additional File 3). The top ranked 20 genes for 
the ALL cluster are enriched in significant GO terms 
like response to external stimulus, immune response 
and cell growth. Genes for the AML cluster had are 
enriched in response to external stimulus, immune 
response and membrane genes. The gene functions in 
PubMed indicated that the two sets of 20 genes are 
enriched in chemokines and tumor suppressor genes. 
Genes for the first cluster of meduloblastoma were 
related to cytoplasm, cell motility and cell growth and/ 
or maintenance and those for the second cluster to 
cytoplasm, biosynthesis and protein metabolism genes. 
Gene sets for other datasets can be found in the supple- 
ment web site. 

The mean expression profiles of the gene-wise clusters 
from the fibroblast dataset were extracted (Additional 
File 4). We clustered genes by using coefficient matrix 
of genes when we applied MFs. Coefficient matrix of 



genes (W matrix) can be used to determine cluster 
membership of genes, that is, gene i belongs to cluster / 
if the Wij is the largest entry in row i. Applying 7<T-means 
algorithm, we clustered genes using original gene 
expression data matrix. Then, we labelled gene-cluster 
corresponding to the labels of sample-cluster. According 
to method mentioned above, gene-wise clusters were 
created by the seven methods. Number of gene-wise 
clusters is five because Xu et al. [21] and Sharan et al. 
[18] suggested that optimal number of clusters is five 
from the fibroblast dataset. While 7<T-means, SVD and 
PCA tend to result a few clusters with dominant profiles 
with the remaining clusters with relatively flat profiles, 
non-orthogonal MFs tend to create clusters with even 
dominance. For example, SVD result shows one major 
peak and BSNMF result shows much more peaks. Non- 
orthogonal MFs seem to be more effective in discover- 
ing significant patterns. 

Discussion 

There are various clustering-based methods which are 
proposed by many researchers. These methods have 
become a major tool for gene expression data analysis. 
Different clustering-based methods usually produce dif- 
ferent solutions and one or a few preferred solutions 
among them should be selected. However, a systematic 
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Table 1 Class Assignment of Acute Myelogenous Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL) 

Kmeans *SVD *PCA *ICA *NMF *SNMF *BSNMF Voting 
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M 


AML_3 


M 


M 


L 


T 




M 


M 


M 


M 


M 


M 


M 


M 


M 


M 


M 


AML 5 


L 


B 


M 


B 




M 


M 


M 


M 


M 


M 


M 


M 


M 


M 


M 


AML_6 


L 


B 


L 


M 




M 


M 


M 


M 


M 


M 


M 


M 


M 


M 


M 


AML_7 


M 


M 


L 


T 




M 


M 


M 


M 


M 


M 


M 


M 


M 


M 


M 


Error Count 


7 


7 


18 


20 


16 


18 


3 


5 


2 


3 


1 


2 


1 


1 


1 


1 



Class Assignment of Acute Myelogenous Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL) at K-2 and K-3. 

* SVD: singular value decomposition, PCA: principal component analysis, ICA: independent component analysis, NMF: non-negative matrix factorization, SNMF: 
sparse non-negative matrix factorization, 

BSNMF: bi-directional non-negative matrix factorization, Voting: Voting class 
** L: ALL, M: AML, B: ALL_B cell, T: ALL_T cell 
Bold-faced: misclassified samples 

evaluation study for the methods has not been reported. 
Therefore, we evaluated orthogonal (i.e. PCA, SVD), 
non-orthogonal (i.e. ICA, NMF and SNMF) MFs and a 
traditional clustering algorithm (i.e. 7<T-means) using 
seven clustering-quality (i.e. homogeneity, separation, 
Dunn index, average silhouette width, Pearson correla- 
tion of cophenetic distance, Hubert correlation of 
cophenetic distance and the GAP statistic) and two pre- 
diction-accuracy measures (i.e. the adjusted Rand index 
and prediction accuracy) applying to five published data- 
sets. We also included an improving non-orthogonal 
MFs, BSNMF in the evaluation study. 

As a result, we observed that clustering quality and 
prediction-accuracy indices applying non-orthogonal 



MFs are better than those of orthogonal MFs and K- 
means. In respect to results from Homogeneity, separa- 
tion, Dunn index, average silhouette width and Hubert 
correlation of cophenetic distance, non-orthogonal MFs 
had higher value than those of orthogonal MFs and K- 
means. The GAP statistic was lower for non-orthogonal 
MFs than for orthogonal MFs and 7<T-means. When we 
tested predictive accuracy for the three datasets with 
known class labels, we also observed better performance 
for non-orthogonal MFs than for the rest. We also 
investigated the biological significance of clustering 
genes because it is important to discover biological rele- 
vant patterns and interpret biologically for analysis of 
DNA microarray gene expression data. When we used 
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Table 2 Class assignment for Medulloblastoma dataset 



□ Sample 


Subgroup 


Kmeans 


*SVD 


*PCA 


ICA 


*NMF 


*SNMF 


*BSNMF 


*Voting 


Brain_MD_7 


classic 


**2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_59 


classic 


**1 


1 


1 


1 


1 


1 


1 


1 


Brain_MD_20 


classic 


1 


1 


1 


1 


1 


1 


1 


1 


Brain_MD_21 


classic 


1 


2 


2 


1 


1 


1 


1 


1 


Brain_MD_50 


classic 


1 


1 


2 


2 


1 


1 


1 


1 


Brain_MD_49 


classic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_45 


classic 


1 


2 


2 


1 


1 


1 


1 


1 


Brain_MD_43 


classic 


1 


2 


2 


1 


1 


1 


1 


1 


Brain_MD_8 


classic 


1 


1 


2 


1 


1 


1 


1 


1 


Brain_MD_42 


classic 


2 


1 


2 


2 


2 


2 


2 


2 


Brain_MD_1 


classic 


2 


1 


2 


2 


2 


2 


2 


2 


Brain_MD_4 


classic 


2 


1 


2 


2 


2 


2 


2 


2 


Brain_MD_55 


classic 


2 


1 


2 


2 


2 


2 


2 


2 


Brain_MD_41 


classic 


1 


1 


1 


1 


1 


1 


1 


1 


Brain_MD_37 


classic 


1 


2 


1 


1 


1 


1 


1 


1 


Brain_MD_3 


classic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_34 


classic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_29 


classic 


1 


2 


1 


1 


1 


1 


1 


1 


Brain_MD_13 


classic 


2 


1 


2 


2 


2 


2 


2 


2 


Brain_MD_24 


classic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_65 


classic 


1 


1 


1 


1 


1 


1 


1 


1 


Brain_MD_5 


classic 


1 


1 


1 


1 


1 


1 


1 


1 


Brain_MD_66 


classic 


1 


2 


1 


1 


1 


1 


1 


1 


Brain_MD_67 


classic 


1 


1 


1 


1 


1 


1 


1 


1 


Brain_MD_58 


classic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_53 


desmoplastic 


2 


1 


2 


2 


2 


2 


2 


2 


Brain_MD_56 


desmoplastic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_16 


desmoplastic 


2 


1 


2 


2 


2 


2 


2 


2 


Brain_MD_40 


desmoplastic 


1 


1 


2 


1 


1 


2 


2 


1 


Brain_MD_35 


desmoplastic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_30 


desmoplastic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_23 


desmoplastic 


2 


2 


2 


2 


2 


2 


2 


2 


Brain_MD_28 


desmoplastic 


1 


1 


2 


2 


2 


1 


2 


2 


Brain_MD_60 


desmoplastic 


1 


2 


2 


1 


1 


1 


2 


2 


Error Count 




14 


16 


16 


14 


13 


13 


11 


12 



Class assignment for Medulloblastoma dataset at K-2 

* SVD: singular value decomposition, PCA: principal component analysis, 

ICA: independent component analysis, NMF: non-negative matrix factorization, 

SNMF: sparse non-negative matrix factorization, 

BSNMF: bi-directional non-negative matrix factorization, Voting: Voting class 
** 1: classic type, 2: desmoplastic type 
Bold-faced: misclassified samples 

enrichment analysis with GO terms and biological path- 
ways, we obtained more significant enriched GO terms 
or pathways for non-orthogonal MFs than for orthogo- 
nal MFs and /Omeans. We also identified genes that 
may be dominantly involved in each subtype. It was 
demonstrated that BSNMF showed improved perfor- 
mance in prediction-accuracy and biological-enrichment 
measures, outperforming other non-orthogonal MFs as 
well as orthogonal MFs and 7<T-means algorithm. 

There are various clustering evaluation indices we 
mentioned. Because they have various results upon data- 
sets, they have limitations to suggest which clustering- 
based method is the best. Therefore, improving cluster 
validation indices is needed to overcome it. We simply 
suggested a voting scheme that simply combines all the 
results from the various algorithms and returns the best 
consensus. Improving evaluation indices can be achieved 



through the integration of results from various evalua- 
tion indices using unifying rules. 

Conclusions 

In conclusion, the clustering performance of orthogonal 
and non-orthogonal MFs was appropriately compared 
for clustering microarray data using various measure- 
ments. We clearly showed that non-orthogonal MFs 
have better performance than orthogonal MFs and K- 
means for clustering microarray data. The characteristic 
difference among non-orthogonal MFs, orthogonal MFs 
and 7<T-means algorithm implies that non-orthogonal 
MFs divided whole data into distinct patterns more 
evenly than orthogonal MFs and 7<T-means. This study 
would help for suitably evaluating diverse clustering 
methods in other genome-wide data as well as microar- 
ray data. 
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Table 3 Number of significantly enriched GO terms (or 
pathways) 



(a) Leukemia dataset 



□ 


Kmeans 


SVD 


PCA 


ICA 


NMF 


SNMF 


BSNMF 


ALL 


480 


389 


441 


453 


532 


425 


535 


AML 


85 


262 


223 


222 


167 


266 


280 


Total 


565 


651 


664 


675 


699 


691 


815 


(b) Medulloblastoma dataset 


□ 


Kmeans 


SVD 


PCA 


ICA 


NMF 


SNMF 


BSNMF 


classic 


517 


373 


467 


479 


388 


456 


599 


desmoplastic 


58 


361 


226 


213 


335 


208 


206 


Total 


575 


734 


693 


692 


723 


664 


805 


(c) Fibroblast dataset 


□ 


kmeans 


SVD 


PCA 


ICA 


NMF 


SNMF 


BSNMF 


clusterl 


52 


45 


71 


47 


57 


41 


128 


cluster2 


32 


35 


68 


27 


54 


48 


69 


cluster3 


48 


24 


63 


61 


37 


75 


50 


cluster4 


126 


38 


37 


96 


108 


60 


155 


cluster5 


54 


63 


60 


33 


65 


68 


102 


Total 


312 


205 


299 


264 


321 


292 


504 


(d) Mouse dataset 


□ 


kmeans 


SVD 


PCA 


ICA 


NMF 


SNMF 


BSNMF 


clusterl 


593 


520 


294 


258 


637 


686 


690 


cluster2 


27 


61 


107 


114 


38 


28 


56 


Total 


620 


581 


401 


372 


675 


714 


746 



Number of significantly enriched terms at a=0.05 



Methods 

Dataset 

For the evaluation study, we used five published data- 
sets. The Leukemia data set [22] has 38 bone marrow 
samples and 5000 genes after filtering process applied 
by Brunet et al.[9]. Acute myelogenous Leukemia 
(AML) and acute lymphoblastic leukemia (ALL) are dis- 
tinguished as well as ALL can be divided into T and B 
cell subtypes. The second is Medulloblastoma dataset 
that is a gene expression profiles from the childhood 
brain tumors. Although the pathogenesis of the tumor is 
not well understood, it can be categorized into two 
known histological subclasses: classic and desmoplastic. 
Pomeroy et al [23] demonstrated the correlation of gene 
expression profiles and the two histological classes. The 
dataset has 34 samples and over 5800 genes. The third 
is the gene expression dataset from Zhang et al (2004, 
http://hugheslab.med.utoronto.ca/Zhang). This dataset 
contains gene expression profiles of over 40000 known 
as well as predicted genes in 55 mouse tissues, organs 
and cell types. We used over 8200 genes after filtering 
with low variance. The forth is the human fibroblast 
gene-expression dataset from Iyer et al[24] with 18 sam- 
ples and 517 genes. The last is the well-known Iris data- 
set [25]. This famous dataset gives the measurements in 



centimetres of the length and width of sepal and petal, 
respectively, for 50 flowers from each of the three spe- 
cies of Iris (i.e. Iris setosa, versicolor and virginica). 
Among datasets, Leukemia, Medulloblastoma and Iris 
dataset have known class labels for samples, while the 
rest have not. 

Non-orthogonal matrix factorization for gene expression 
analysis 

The gene-expression data is typically represented as an 
N-by-M matrix A. In the matrix, each row represents 
the expression values of a gene across all samples. Each 
column represents the expression values of all genes in 
a sample. NMF can decompose gene expression data 
and derive parts-based representation of the whole data. 
It factorizes a matrix A into the product of two 
matrices, including non-negative entries, formulated as 
A = WH. W and H are N-by-K and K-by-M matrices, 
respectively, and K is much smaller than M. The col- 
umn of W can be regarded as a metagene, consisting of 
elements w t . Each element represents the coefficient of 
gene i in metagene The columns of matrix H repre- 
sent the metagene expression pattern of the correspond- 
ing sample. Each element h t j indicates the expression 
value of metagene i in sample The cluster membership 
can be determined based on such a factorization of the 
matrix H. Sample j belongs to cluster i if the h t j is the 
largest entry in column 

Brunet et al [9] represented parts corresponding to 
metagenes which represent genes tend to be co- 
expressed in samples. Here parts mean sets of elements, 
indicating the building blocks for the whole. These 
metagenes can overlap, indicating that a single gene 
may be involved in a number of pathways or biological 
processes. Therefore, sparseness constraints are needed. 
NMF with sparseness constraints has been proposed by 
a few groups. Gao and Church [14] proposed a method 
to enforce sparseness of H matrix by penalizing the 
number of non-zero entries. This method enforces spar- 
seness by combining the goal of minimizing reconstruc- 
tion error with that of sparseness [14]. Specifically, they 
adopt the point-count regularization approach that 
enforces sparseness of H by penalizing the number of 
non-zero entries rather than the sum of entries Yhij in 
H [11,12,15]. The sparseness is controlled by the para- 
meter and larger parameter makes the H matrix become 
more and more sparse. Here, the optimization leads the 
resulting H matrix to contain as many zero entries as 
possible. Gao's method enforces sparseness to H matrix 
only. We applied the sparseness constraints bi-direction- 
ally to both W and H. Because microarray gene expres- 
sion data analysis involves clustering by genes as well as 
by samples. In microarray data analysis, sample-based 
clustering can be used to classify samples with similar 
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Figure 4 Weighted p-value of significantly enriched GO terms. Weighted p-value of significantly enriched GO terms, (a) and (b) represent 
result of ALL and AML cluster in leukemia dataset. (d) and (e) show result of cluster 1 (assigned to classic type) and cluster 2 (assigned to 
desmoplastic type) in medulloblastoma dataset. Among the entire significantly enriched factors, top 50 factors are represented, (c) and (f) 
represent result of top 50 factors in each entire dataset. Results from other dataset are shown in supplementary site. 



appearance while gene-based clustering can provide 
informative sets of tightly co-regulated genes and infor- 
mation about activity of genes. While sample-based 
clustering relies on metagenes, gene-based clustering 
relies on meta-samples. The two processes can be 
viewed as bi-directionally constrained with each other. 
Good metagene may support good sample-based clus- 
ters and vice versa. Optimizing sample-dimension only, 
sparseness of gene-dimension is relatively decreased 
when sparseness of sample-dimension is increased. In 
result, the minimization problem is convex that was 
subsequently described by others [11,12,14,15] and the 
resulting matrix cannot support gene-based clusters 
well. Therefore, optimizing both sample and gene 
dimension together may be appropriated for clustering 
of microarray data. This method can optimize both sam- 
ple and gene clustering. In this paper, we especially 
focus on BSNMF (Bi-directional Sparseness Non-nega- 
tive matrix factorization). The definition and algorithm 
is described below. 



Definition: Bi-directional Sparseness Non-negative 
matrix factorization (BSNMF) 

Given a non-negative gene expression data V of size 
N-by-M, find the non-negative matrices W and H of 
size N-by-C and C-by-M (respectively) such that 

E(W, H) = ||V-WH|| 2 

is minimized, under optional constraints: 

Sparseness (wj) = Xi 

Sparseness (hi) = X 2 , 

where w t is the i th column of W and h t is the i th row 
of H. Here, C denotes the number of components 
(metagenes), X x and X 2 are the desired sparseness of W 
and H, respectively. These three parameters are set by 
the experimenters. 

Algorithm: Bi-directional Sparseness Non-negative 
matrix factorization (BSNMF) 

1. Initialize W and H to random positive matrices of 
dimension N-by-C and C-by-M, respectively. 

2. Rescale the column of W and the row of H to unit 
norm. 



Kim et al. BMC Bioinformatics 201 1, 12(Suppl 13):S8 
http://www.biomedcentral.com/1471-2105/12/S13/S8 



Page 11 of 14 



R - 





i 



n n 



i ° 

^ 8 



Figure 5 Log scaled p-values for significantly enriched factors. Log scaled p-values for significantly enriched factors. Each plot represents 
significantly enriched terms (at a=0.05) at AML cluster in leukemia dataset using (a) K-means and (b) BSNMF. x-axis represents log 10 (p-value). 
Entire factors were divided into five categories, GO term of biological process (BP), GO term of cellular component (CC), GO term of molecular 
function (MF), BIOCARTA, and pathway of KEGG. 



3. Iterate until convergence or reach maximum num- 
ber of allowed iteration. 
(1) If sparseness constraints on H apply 

a. solve W iia+1) = W^VH T W(WHH T ) /a 

b. Rescale the column of W to unit norm 

c. Solve for each / 

min {l/2||V r WH y || 2 + 1/2^| |H ; | | 2 } 



d. if (H, y <0) then H* 7 - =0 

(2) If sparseness constraints on W apply, 

a. solve H iia+1) = HjW T V)J(W T WH), fl 

b. Rescale the column of H to unit norm 

c. Solve for each ; 

min {l/2||V r WH y || 2 + 1/2X 2 \ \Wj\ | 2 } 

d. if (W, y <0) then W tJ =0 
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Measures of clustering evaluation 

In this study, we attempt to evaluate various MFs using 
cluster evaluation indices. Here, we briefly introduce 
cluster evaluation indices we used. 

Compactness The first measures estimate cluster com- 
pactness or homogeneity with intra cluster variance. Many 
variants of between-cluster homogeneity measure are able 
to estimate average or maximum pair wise between- cluster 
distances, average or maximum centroid-based similarities 
or the use of graph-based approaches [26]. For this pur- 
pose, we used the homogeneity index by Sharan and Sha- 
mir. Homogeneity index is defined as: 



H avg =^^Corr(C(x),C(M)). 



xeC 

In this equation, C is a cluster. C(M) is the cluster 
centroid and C(x) is each data item. Corr(C(x), C(M)) is 
the correlation coefficient between each data item and 
the centroid. N is the number of data items. 

Separation The second index quantifies the degree of 
separation between individual clusters. For example, the 
average weighted within-cluster distances define an 
overall rating for a partitioning, where the distance 
between individual clusters can be calculated as the dis- 
tance between cluster centroids, or as the minimum dis- 
tance between data items belonging to different clusters. 
Alternatively, we used cluster separation in a partition- 
ing which may be estimated as the minimum separation 
observed between individual clusters in the partitioning. 
Separation is defined as: 

Separation = min (mm(dist(Ck, C/))). 

C k eC QeC 

Where dist{C h Q) is the minimum distance between a 
pair of data items, i and with i e C k and ; e Q. 

Combinations There are a number of enhanced 
approaches combining measures of the different types of 
cluster evaluation indices. Several methods therefore esti- 
mate both between-cluster homogeneity and within-cluster 
separation. They compute a resulting score by combining 
linearly or non-linearly the two measures. A well-known 
linear combination is the SD-validity Index [27] and non- 
linear combinations include the Dunn Index [28], Dunn- 
like-Indices [26], the Davies-Bouldin Index [29] and the sil- 
houette width [30] . We used Dunn Index and average sil- 
houette width. The Dunn Index measures the ratio 
between the smallest cluster distance and the largest 
between-cluster distance in a partitioning. It is defined as: 



D(C) 



■■ mm 

C fe GC 



distiCk, Cl) 

mm — 

c,gc max diam{Cm) 

CmeC 



where diam(C m ) is the maximum intra-cluster dis- 
tance within cluster C m and dist{C h Ci) is the minimum 
distance between pairs of data items, i and with i e 
and j g Q. The interval of Dunn Index is [0, +°o] 
and it should be maximized. 

The silhouette width for a partitioning is computed as 
the average silhouette value over all data items [30]. For 
each observation /, the silhouette width s(i) is defined 
as 



s(i) = 



b{i) - a{i) 
max(a(i),fr(i)) 



where a(i) is the average dissimilarity between i and 
all other points of the cluster to which i belongs. b(i) is 
the average dissimilarity of i between all observations in 
its neighbour cluster. A large s(i) means that data items 
are "well clustered". 

Compliance between partitioning and distance 
information An alternative way of estimating cluster 
validity is to directly assess the degree to which distance 
information in the original data is consistent with a par- 
titioning. For that purpose, a partitioning can be repre- 
sented by means of its cophenetic matrix [31], of which 
each entry C(/, j) indicates whether the two elements, i 
and j are assigned to the same cluster or not. In hier- 
archical clustering, the cophenetic distance between two 
observations is defined as the inter-group dissimilarity 
at which two observations are first joined in the same 
cluster. The cophenetic matrix can be compared with 
the original dissimilarity matrix using Hubert's correla- 
tion, the normalized gamma statistic, or a measure of 
correlation such as the Pearson [32] or Spearman's rank 
correlation [33]. We used Hubert's and Pearson corre- 
lations. The definition of the Huberts correlation is 
given by the equation: 

N-l N 



i=l j=i+l 

where M - N(N-l)/2, P is the proximity matrix of the 
data set and Q is an N-by-N matrix of which (/, ;) ele- 
ment represents the distance between the representative 
points yv c ,,v c _ j of the clusters where the objects x t 
and Xj belong. 

Number of clusters Most of the internal measures 
discussed above can be used to assess the number of 
clusters. If both clustering algorithms employed and the 
internal measures are satisfactory for the dataset under 
consideration, the best number of clusters can be 
obtained by a knee in the resulting performance curve. 
To measure whether the 'optimal' number of clusters is 
found, we used Gap Statistic [34]: 
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Gap(k)=^w; b -\og(W k ). 



K is the total number of clusters giving within disper- 
sion measures W k , k = 1,2,..., K. The Gap statistic 
should be minimized to find the 'optimal' number of 
clusters. 

Predictive power and accuracy A number of indices 
can assess agreement between a partitioning and the 
gold standard by observing the contingency table of the 
pair wise assignment of the data items. The well-known 
index is the Rand Index [35], which determines the 
similarity between two partitions by penalizing false 
positive and false negative. There are a number of varia- 
tions in Rand Index. In particular, the adjusted Rand 
Index [36] introduces a statistically induced normaliza- 
tion to yield values close to zero for random partitions. 
Another related indices are the Jaccard coefficient [37] 
and the Minkowski Score [38]. We used the adjusted 
Rand Index to estimate the similarity between clustering 
results and the known class labels. The Adjusted Rand 
Index is defined as: 



R(U,V)= — 



[ ]T (n ; C2) + ^ (n fe C2) ] - [ £ [nfil) ] / nCl 



where n ik denotes the number of data items assigned 
to both cluster / and cluster k. The Adjusted Rand 
Index has a value in the interval [0, 1] and is to be 
maximized. 

The accuracy of clustering is measured by the fol- 
lowing formula [39]: 



AC=-^ 

n 

where / (/;) is 1 if the cluster assignment is correct for 
sample j it otherwise 0 if the cluster assignment is 
incorrect. 

Biological enrichment analysis 

We applied biological enrichment analysis to clustering 
results in order to assess whether functionally related 
genes are grouped. The resulting genes from clustering 
are then subdivided into functional categories for biolo- 
gical interpretation. Such functional categorization was 
accomplished using GO terms and biological pathways. 
We used DAVID 2.1 (http://david.abcc.ncifcrf.gov/) for 
GO term enrichment analysis and ArrayXPath [40,41] 
for pathway annotation. A modified Fisher's exact test is 
performed to determine whether the proportions of 
members falling into each category differ by group, 



when those in two independent groups fell into one of 
the two mutually exclusive categories. Therefore, lower 
j?-value indicates a better association of cluster 
members. 

Additional material 



Additional file 1: Illustration of separation vs. homogeneity 

Illustration of separation vs. homogeneity. Results from each dataset are 
gathered. Each color means each method. Results from NMF, SNMF and 
BSNMF have higher slope. That is, homogeneity and separation are more 
optimized. 

Additional file 2: Illustration of Hubert gamma Illustration of Hubert 
gamma. It is a measure of compliance between partitioning and distance 
information. Each plot shows result from each datasets at rank K-2, 3, 4 
(for Iris dataset) or K=2, 3, 4 and 5 (for the rest), (a) Leukemia dataset (b) 
medulloblastoma dataset (c) Iris dataset (d) fibroblast dataset (e) Mouse 
dataset. 

Additional file 3: The twenty common genes in each leukemia 
subtype The twenty common genes in each leukemia subtype 

Additional file 4: Patterns of mean expression level for each cluster 
for fibroblast dataset Patterns of mean expression level for each cluster 
for fibroblast dataset. (a) /(-means, (b) SVD, (c) PCA, (d) ICA, (e) NMF, (f) 
SNMF and (g) BSNMF. Each lines represent for each cluster. 
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